Drain per-host reservation when a VM starts on a different host#13363
Open
Kukunin wants to merge 2 commits into
Open
Drain per-host reservation when a VM starts on a different host#13363Kukunin wants to merge 2 commits into
Kukunin wants to merge 2 commits into
Conversation
When a VM is stopped via the API, postStateTransitionEvent moves its used capacity into reserved_capacity on its last host (so a quick restart can reclaim it cheaply). Until now, this reservation was only drained when the VM started again on the same lastHostId. Starting on any other host left an orphan reservation that lingered for up to one hour (capacity.skipcounting.hours) and was summed into the cluster used+reserved aggregate by FirstFitPlanner.removeClustersCrossingThreshold, spuriously tripping the disable-threshold and blocking later starts. Always drain the VM's reservation on its previous host before allocating on the target host — same host or not. This removes the fromLastHost branch (now dead, the only other caller already passes false), the matching boolean parameter on allocateVmCapacity, and the misspelt moveToReservered parameter everywhere it appears. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Jun 6, 2026
Contributor
|
@blueorangutan package |
|
@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress. |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes a capacity-accounting asymmetry where per-host reserved CPU/RAM left behind on a VM’s previous host could linger after the VM starts on a different host, inflating used+reserved and potentially blocking subsequent placements at the cluster threshold.
Changes:
- Drain the VM’s reserved capacity on its
last_host_idduring relevant state transitions before allocating capacity on the target host. - Simplify capacity allocation by removing the now-unreachable
fromLastHostpath and updating callers accordingly. - Add unit tests to ensure reservations are released for both same-host and different-host starts; also standardize logging and fix a long-standing
moveToReserveredtypo.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| server/src/main/java/com/cloud/capacity/CapacityManagerImpl.java | Unifies reservation draining behavior on start/migrate transitions; removes fromLastHost logic; adjusts logging/typo. |
| engine/components-api/src/main/java/com/cloud/capacity/CapacityManager.java | Updates the public API to remove the allocateVmCapacity(..., fromLastHost) parameter and fixes the moveToReservered typo. |
| engine/orchestration/src/main/java/com/cloud/vm/VirtualMachineManagerImpl.java | Updates call site to the new allocateVmCapacity(vm) signature. |
| server/src/test/java/com/cloud/capacity/CapacityManagerImplTest.java | Adds regression tests covering reservation draining on same-host and different-host starts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
174
to
+178
| HostVO host = _hostDao.findById(hostId); | ||
| if (HypervisorType.External.equals(host.getHypervisorType())) { | ||
| return true; | ||
| } | ||
| return releaseVmCapacity(vm, moveFromReserved, moveToReservered, host); | ||
| return releaseVmCapacity(vm, moveFromReserved, moveToReserved, host); |
Comment on lines
942
to
944
| Host lastHost = _hostDao.findById(vm.getLastHostId()); | ||
| Host oldHost = _hostDao.findById(oldHostId); | ||
| Host newHost = _hostDao.findById(vm.getHostId()); |
Comment on lines
985
to
990
| if ((newState == State.Starting || newState == State.Migrating || event == Event.AgentReportMigrated) && vm.getHostId() != null) { | ||
| boolean fromLastHost = false; | ||
| if (vm.getHostId().equals(vm.getLastHostId())) { | ||
| logger.debug("VM starting again on the last host it was stopped on"); | ||
| fromLastHost = true; | ||
| if (vm.getLastHostId() != null) { | ||
| releaseVmCapacity(vm, true, false, vm.getLastHostId()); | ||
| } | ||
| allocateVmCapacity(vm, fromLastHost); | ||
| allocateVmCapacity(vm); | ||
| } |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 18263 |
Two defensive fixes from Copilot's review of PR apache#13363: 1. releaseVmCapacity(VM, ..., Long hostId): _hostDao.findById(hostId) can return null when last_host_id points to a deleted host. The subsequent host.getHypervisorType() check would NPE. The Host overload already null-checks AND has the External check, so delegate to it instead of duplicating the External check here. 2. postStateTransitionEvent: the lastHost was being fetched twice (once for logging at L942, again inside releaseVmCapacity via the Long overload at L987). Reuse the already-resolved Host. The Host overload also handles null, so the explicit null check on getLastHostId() becomes redundant. (Skipped Copilot's third suggestion to guard findById(null) — the GenericDaoBase.findById(ID, boolean, Boolean) implementation at GenericDaoBase.java:1066 explicitly returns null for null id, so the existing code is already safe.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What's happening
When a VM is stopped via the API, CloudStack moves its CPU/RAM from
usedtoreservedon itslast_host_id(see theStopping → Stopped + OperationSucceededbranch inCapacityManagerImpl.postStateTransitionEvent). The idea is that a quick restart on the same host can reclaim its earmarked slot cheaply.The asymmetry: when the VM later starts on the same host, the
fromLastHost=truebranch inallocateVmCapacitydrains that reservation. When it starts on a different host, the old reservation just sits there. It only clears when:capacity.skipcounting.hours(default 1h) elapses andupdateCapacityForHostrecycles it, orUntil then, the orphan reservation gets summed into the cluster's
used + reservedaggregate byFirstFitPlanner.removeClustersCrossingThreshold. On a cluster that's already nearcluster.memory.allocated.capacity.disablethreshold(default 0.85), the phantom can trip the threshold and block subsequent VM starts in the whole cluster — even though the VM in question isn't actually consuming anything on its old host.We hit this on a fairly full cluster where stop-then-start cycles started failing intermittently with
InsufficientServerCapacityException: No destination found. The "ghost" capacity was the released-but-not-drained reservation from the last stop.The fix
postStateTransitionEventnow drains the VM's reservation on its previous host before allocating on the target host — regardless of whether the target is the same host or a different one. Treating both cases identically removes thefromLastHostasymmetry.Side effects of unifying the path:
fromLastHost=truebranch inallocateVmCapacityis now unreachable frompostStateTransitionEvent. The only other caller (VirtualMachineManagerImpl#reconfiguringOnExistingHost) already passesfalse, so the parameter is removed entirely.moveToReserveredtypo (3 e's) — renamed tomoveToReservedthroughout the interface, the impl, and the debug logs.logger.debug(String.format(...))inpostStateTransitionEventswitched to SLF4J{}placeholders to match the rest of the file.Tests
Two new tests in
CapacityManagerImplTestcover both transitions:testPostStateTransitionReleasesStaleReservationWhenStartingOnDifferentHost— fails onmain, passes with the fix. VerifiesreleaseVmCapacity(vm, true, false, oldHostId)is invoked.testPostStateTransitionReleasesReservationWhenStartingOnSameHost— guards the unified contract so both paths stay in lockstep.Manual validation
Reproduced on a small test cluster:
reserved=128Mappears on A inop_host_capacity).hostid.release mem from host: A, old reserved: 128MB → new reserved: 0.Runningon B,last_host_idupdates to B andupdateCapacityForHostagrees the reservation is gone.Same-host stop/start round-trip continues to behave the same way as before.